韩国专利KR20010011066A Method for detecting end point of voice using adaptive codebook energy and adaptive codebook gain

专利PDF首页>>韩国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
PURPOSE: A method for detecting a voice block using energy and gain of an adaptive code book is provided to apply two-level threshold logic after calculating energy information of completely decoded voice so reducing the decording time, and to enable the security of an equal recognition rate when a user uses a voice recognition function, to enable the reduction of a standby time. CONSTITUTION: Calculating energies per frames having a value equal to energy of completely decoded voice by using energy and gain of an adaptive code book. The first threshold and the second threshold are set up, on the basis of the energies per frames. The energies per frames is compared with the first threshold and the second threshold. According to compared results, a start point and an end point of the voice block are detected to detect the voice block.
公开号:KR20010011066A
申请号:KR1019990030263
申请日:1999-07-24
公开日:2001-02-15
发明作者:강명수；김재원
申请人:조정남；에스케이 텔레콤 주식회사；
IPC主号:

专利说明:

Speech segment detection method using energy and gain of adaptive codebook {METHOD FOR DETECTING END POINT OF VOICE USING ADAPTIVE CODEBOOK ENERGY AND ADAPTIVE CODEBOOK GAIN}
The present invention relates to a method for detecting a speech segment using energy and gain of an adaptive codebook in a speech recognition system and a computer readable recording medium having recorded thereon a program for realizing the speech segment.
1 is an exemplary configuration diagram of a general speech recognition system.
As shown in FIG. 1, a general speech recognition system is composed of three parts, a speech segment detector 101, a feature extractor 102, and a recognizer 103.
Here, the speech section detector 101 finds only the speech signal section from the input signal, the feature extractor 102 extracts the speech feature vector required for the recognition from the found speech signal section, and the recognizer 103 finds the speech section. Recognize speech from speech feature vectors. The present invention relates to voice segment detection (EPD: End Point Detection) in the speech segment detector 101.
In general, a method using energy values obtained from a speech waveform and a zero crossing rate or a level crossing rate is used to detect a speech segment. However, in order to detect a speech segment using only vocoded packets, another method is used. The method should be chosen. This is because the processing power of the terminal CPU is not enough to decode to obtain a complete speech waveform in real time from an Enhanced Variable Rate Codec (EVRC) packet.
More specifically, Line Spectral Pairs (LSP) decoding, adaptive codebook gain and delay decoding, fixed codebook gain and index decoding, and adaptive codebook The total time to update the adaptive codebook memory takes only about 7% of the total decoding time.
On the other hand, it takes about 46% of decoding time in the process of converting a pair of line spectra into energy and in synthesizing and filtering the energy. Thus, a completely different approach is necessary to obtain information equivalent to the energy of a fully decoded speech while processing in real time.
The present invention devised in response to the above-described requirements can be read by a computer recording a voice segment detecting method for detecting a speech segment using an adaptive codebook energy and an adaptive codebook gain and a program for realizing the same. The purpose is to provide a recording medium.
1 is an exemplary configuration diagram of a general speech recognition system.
2A and 2B are flowcharts illustrating an embodiment of a method for detecting a speech segment using energy and gain of an adaptive codebook according to the present invention.
3 is an explanatory diagram according to a result of detecting a start point and an end point according to an embodiment of the present invention;
* Explanation of symbols for the main parts of the drawings
101: speech segment detector 102: feature extractor
103: recognizer
In the speech segment detection method in the speech recognition system of the present invention for achieving the above object, the first to obtain the energy for each frame having a value similar to the energy of the fully decoded speech using the energy and gain of the adaptive codebook step; A second step of setting first and second threshold values based on the energy of each frame; And a third step of detecting a speech section by comparing the energy of each frame with the first and second thresholds and detecting a start point and an end point of the speech section according to a comparison result.
In addition, the present invention provides a speech recognition system having a processor, comprising: a function for obtaining energy for each frame having a value similar to that of a fully decoded speech using energy and gain of an adaptive codebook; Setting a first and a second threshold value based on the energy of each frame; And comparing the energy for each frame with the first and second thresholds and detecting a start point and an end point of the speech section according to a comparison result, thereby recording a program for realizing a function of detecting the speech section. Provide the medium.
The present invention proposes a method that obtains the result similar to the energy of the fully decoded speech using the energy and the gain of the adaptive codebook, and after obtaining the energy information of the speech by the proposed method, the threshold logic having two levels By applying two-level threshold logic, the speech section can be detected.
The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.
2A and 2B are flowcharts illustrating a method of detecting a speech segment using energy and gain of an adaptive codebook according to the present invention.
In order to detect the speech section, the energy of speech is largely calculated, and the start point and the end point of the speech section are detected from the speech energy information.
First, a process of calculating voice energy will be described in detail.
First, the energy of each frame of the adaptive codebook is obtained. That is, in the EVRC vocoder, adaptive codebook gain and delay are processed in units of three subframes of 53, 53, and 54 samples for each frame for adaptive codebook generation. Accordingly, the energy of the adaptive codebook is also calculated in subframe units, and then added together to finally obtain a value corresponding to one complete frame of 160 samples.
Therefore, the energy E of the final adaptive codebook in one frame unit is expressed by Equation 1 below.
Here, E _{T, m} (n) is the adaptive codebook energy of the m'-th subframe, and the energy takes an absolute value to minimize the amount of computation.
In this way, the energy takes an absolute value to minimize the amount of calculation, but the energy value thus obtained is insufficient information than the energy value of the fully decoded speech, so for example vowel pronunciation such as 'oh', 'right' This has the disadvantage that the value for is small.
Therefore, to compensate for this, the average of gains of the adaptive codebook is framed and then averaged again with the averages of previous and subsequent frames. Then, the square of the average adaptive codebook gain per frame thus obtained is multiplied by the adaptive codebook energy value of the current frame calculated in [Equation 1]. If this is expressed as an equation, Equation 2 below.

Here, g _{p, m ', f} is the adaptive codebook gain of the m'-th subframe in the f-th frame, and N represents the current frame. This results in a value similar to the energy of the fully decoded speech.
In this case, since the adaptive codebook gain is changed for each subframe, the average value is calculated for one frame, and the average is again reflected with the values obtained in the previous frame and the next frame.
Now, the process of detecting the start point and the end point of the speech section from the obtained speech energy information and determining the difference as the speech section will be described in more detail.
First, in order to detect a starting point, a threshold value is set based on the energy per frame of the present invention, wherein the threshold value sets two threshold values: a first threshold value and a second threshold value.
As such, when the threshold is set, the instant of exceeding the first threshold may be the starting point of the voice section, but since some noise may exceed the first threshold, the energy may be the first to be excluded. When the number of times the energy exceeds the second threshold is greater than another predetermined starting length frame number (NO_FRAME_START2) during a period in which the threshold is exceeded and this state is maintained at or above the predetermined starting length frame number (NO_FRAME_START), Think of it as a starting point.
On the other hand, the process of detecting the end point of the speech section is also obtained similarly to the process of detecting the start point of the speech section. The end point of the speech section may be a moment falling below the first threshold as opposed to the starting point.
However, in this case, it is also possible to misjudge that the end point is detected before the end of the first syllable and the next syllable starts, and thus, after falling below the first threshold, similar to the start point detection, the state determines the number of end point frames (NO_FRAME_END). If it stays longer then it is considered an end point. That is, if another starting point is detected because the first end point is detected and the frame length of the end point does not pass, it is recognized as one connected word, the detected end point is canceled, and the end point detection condition is searched again.
Here, another consideration is that the case where the first threshold is exceeded for a short period by the instantaneous noise for the number of endpoint length frames does not affect the endpoint detection.
Finally, when the difference between the end point and the start point of the voice section is less than or equal to the predetermined minimum frame (MIN_DURATION) or more than the maximum frame (MAX_DURATION), it is regarded as not normal voice and voice is input again.
This will be described in detail according to the flow shown in the drawings.
2A and 2B are flowcharts illustrating a method of detecting a speech segment using energy and gain of an adaptive codebook according to the present invention.
As shown in Figures 2a and 2b, the speech segment detection method using the energy and gain of the adaptive codebook according to the present invention, after first initializing the first start length, the number of frames, the second start length, and the end point length The number of frames is increased by one (202) to check whether the energy is greater than the first threshold set based on the energy per frame (203).
As a result of the inspection, if the energy is not greater than the first threshold, the operation is repeatedly performed from step 202 of increasing the number of frames by one.
As a result of the inspection, if the energy is greater than the first threshold, it is checked whether the energy is greater than the second threshold set based on the energy per frame (204).
As a result of the inspection, if the energy is not greater than the second threshold, the first starting length is increased by 1 (206). If the energy is greater than the second threshold, the first starting length is increased by 1 (205), and then the first starting length is increased by 1 (205). Increasing (206), it is checked whether the first start length is larger than the predetermined number of start length frames (NO_FRAME_START) (207).
As a result of the inspection, if the first start length is not greater than the predetermined start length frame number, the step 202 is repeated to increase the number of frames by one, and when the first start length is larger than another start length frame number NO_FRAME_START2 Analyze if large (208).
As a result of the inspection, if the second start length is not larger than another predetermined start length frame, step 202 of repeatedly increasing the number of frames by one is repeated, and if the second start length is larger, the frame is detected as a starting point (209).
Since the start point that is the start of the voice section has been detected, the end point must be detected to detect the voice section.
As shown in FIG. 2B, in order to detect the end point of the speech section, the number of frames is first increased by 210, and it is checked whether the energy is smaller than the first threshold (211).
As a result of the inspection, if the energy is not less than the first threshold, the number of frames is repeatedly increased from step 210, and if it is small, the length of the endpoint is increased by 212, so that the number of endpoint length frames defined by the endpoint length (NO_FRAME_END (213).
As a result of the analysis, if the end point length is not larger than the predetermined end point length frame number, the frame number is increased by 1 (214) and the energy is checked to be smaller than the first threshold value (215).
As a result of the test, if the energy is less than the first threshold value, the first and second start lengths are initialized (216) and the end point length is increased by one (212).
As a result of the inspection, if the energy is not smaller than the first threshold, the endpoint length and the first starting length are increased by 1 (217), and it is checked whether the energy is greater than the second threshold (218).
As a result of the inspection, if the energy is greater than the second threshold value, the second start length is increased by one (219), and then it is checked whether the first start length is greater than the predetermined number of start length frames N0_FRAME_START (220).
As a result of the inspection, if the energy is not greater than the second threshold, it is checked whether the first start length is greater than the predetermined number of start length frames N0_FRAME_START (220).
As a result of the check, if the first start length is not greater than the predetermined start length frame, the number of frames is repeatedly increased from step 214. If the first start length is larger than the other starting length frame number N0_FRAME_START2, the second start length is increased. It is determined whether it is large (221).
As a result of determination, if the second start length is not greater than another predetermined start length frame, the frame number is repeatedly increased by one step (214), and when the second start length is larger, the endpoint length and the first and second start lengths are initialized (222). ) Is repeated from step 210 of increasing the number of frames by one.
As a result of the analysis, when the end point length is larger than the predetermined end point length frame NO_FRAME_END, the frame at that time is detected as the end point (223).
Finally, if the speech section, which is the difference between the detected end point and the starting point, is equal to or less than the predetermined minimum frame period or more than the maximum frame period (224), the speech section is input again (225), and the speech section is detected otherwise. So go to the next step in speech recognition.
3 is an explanatory diagram according to a result of detecting a start point and an end point according to an embodiment of the present invention.
As shown in FIG. 3, when the first threshold value and the second threshold value are set, a predetermined start length frame number NO_FRAME_START is 5, another predetermined start length frame number NO_FRAME_START2 is 2, and a predetermined end point length. When the frame number NO_FRAME_END is set to 10, the predetermined minimum frame period MIN_DURATION is set to 18, and the maximum frame period MAX_DURATION is set to 50, the end point of the speech section may be detected.
Although the technical idea of the present invention has been described in detail according to the above preferred embodiment, it should be noted that the above-described embodiment is for the purpose of description and not of limitation. In addition, those skilled in the art will understand that various embodiments are possible within the scope of the technical idea of the present invention.
As described above, the present invention not only detects a speech section but also decodes the speech section by applying two-level threshold logic after obtaining energy information of a fully decoded speech. You can also save time. In addition, when the user uses the speech recognition function, by guaranteeing the same recognition rate and reducing the waiting time to recognize, it is possible to increase the convenience and reduce the calculation amount to reduce the detection time of the speech section.

权利要求:
Claims (8)
[1" claim-type="Currently amended] In the speech section detection method in the speech recognition system,
A first step of obtaining frame-by-frame energy having a value similar to that of a fully decoded speech using energy and gain of the adaptive codebook;
A second step of setting first and second threshold values based on the energy of each frame; And
A third step of detecting a speech section by comparing the energy of each frame with the first and second thresholds and detecting a start point and an end point of the speech section according to a comparison result;
Speech segment detection method using the energy and gain of the adaptive codebook made, including.
[2" claim-type="Currently amended] The method of claim 1,
The first step is,
A fourth step of obtaining the energy of the adaptive codebook corresponding to one frame by adding all the values calculated in each subframe unit;
A fifth step of taking an average of the adaptive codebook gains and taking a square root; And
A sixth step of multiplying the energy of the adaptive codebook by a gain to obtain energy for each frame having a value similar to that of a fully decoded speech
Speech segment detection method using the energy and gain of the adaptive codebook made, including.
[3" claim-type="Currently amended] The method of claim 2,
The adaptive codebook gain is
Since the value changes for each subframe, an average value for one frame is obtained, and the average value is a value obtained by averaging the values obtained in the previous frame and the next frame with the energy and gain of the adaptive codebook.
[4" claim-type="Currently amended] The method according to any one of claims 1 to 3,
The one frame,
Substantially, the speech segment detection method using energy and gain of an adaptive codebook, characterized in that it consists of 160 samples in which three subframe units of 53, 53, and 54 samples are summed.
[5" claim-type="Currently amended] The method according to any one of claims 1 to 3,
The process of detecting the starting point of the voice section of the third step,
A seventh step of initializing the first start length, the number of frames, the second start length, and the end point length;
An eighth step of increasing the number of frames by one (" 1 ");
A ninth step, if the energy of each frame is not greater than the first threshold value, proceeding to the eighth step;
A tenth step of increasing the second and first starting lengths by one (" 1 ") when the energy per frame is greater than the first threshold and the energy is greater than the second threshold;
After performing the tenth step, if the first start length is not greater than the preset first start length frame or the second start length is not greater than the preset second start length frame, the process proceeds to the eighth step. Eleventh step; And
A twelfth step of detecting the frame as a starting point when the first starting length is greater than the first starting length frame number and the second starting length is larger than the second starting length frame number after performing the tenth step.
Speech segment detection method using the energy and gain of the adaptive codebook made, including.
[6" claim-type="Currently amended] The method according to any one of claims 1 to 3,
Detecting the end point of the speech section of the third step,
A seventh step of initializing the first start length, the number of frames, the second start length, and the end point length;
An eighth step of increasing the number of frames by one (" 1 ");
A ninth step if the energy of each frame is not less than the first threshold, the process proceeds to the eighth step;
A tenth step of increasing the end point length by one (" 1 ") if the energy per frame is less than the first threshold value;
An eleventh step of comparing the end point length with a preset number of end point length frames after performing the ninth step;
A twelfth step of increasing the number of frames by one ("1") if the end point length is not greater than the end point length frame number as a result of the comparison of the eleventh step;
After performing the twelfth step, if the energy for each frame is less than the first threshold value, the thirteenth step for moving to the ninth step after initializing the first and second start lengths;
After performing the twelfth step, if the energy for each frame is not less than the first threshold value, increasing the end point length and the first start length by one (" 1 ");
A fifteenth step of increasing the second start length by one (" 1 ") if the energy per frame is greater than the second threshold;
After the performing of the fourteenth step, if the first starting length is not greater than a predetermined first starting length frame number, moving to the twelfth step;
After performing the fifteenth step, if the second start length is not greater than a preset second start length frame number, moving to the twelfth step;
After performing the sixteenth step and the seventeenth step, if the first start length is greater than the first start length frame and the second start length is greater than the second start length frame, the end point length and the first And an eighteenth step of proceeding to the eighth step after initializing the second start length; And
A nineteenth step of detecting the frame as an end point if the end point length is larger than the end point length frame as a result of the comparison in the eleventh step;
Speech segment detection method using the energy and gain of the adaptive codebook made, including.
[7" claim-type="Currently amended] The method of claim 6,
After performing the nineteenth step, if the difference between the end point and the starting point is less than or equal to the predetermined minimum frame period or more than the maximum frame period, further comprising the step 20 of receiving the voice again using the energy and gain of the adaptive codebook .
[8" claim-type="Currently amended] In a speech recognition system having a processor,
Obtaining energy per frame having a value similar to that of a fully decoded speech using energy and gain of the adaptive codebook;
Setting a first and a second threshold value based on the energy of each frame; And
A function of detecting a speech section by comparing the energy of each frame with the first and second thresholds and detecting a start point and an end point of the speech section according to a comparison result.
A computer-readable recording medium having recorded thereon a program for realizing this.

类似技术:

公开号 | 公开日 | 专利标题

Mak et al.2014|A study of voice activity detection techniques for NIST speaker recognition evaluations

US10540979B2|2020-01-21|User interface for secure access to a device using speaker verification

US8554564B2|2013-10-08|Speech end-pointer

US8731936B2|2014-05-20|Energy-efficient unobtrusive identification of a speaker

US9009047B2|2015-04-14|Specific call detecting device and specific call detecting method

EP2994910B1|2017-06-14|Method and apparatus for detecting a target keyword

Campbell et al.1989|An expandable error-protected 4800 bps CELP coder |

DE112010005959B4|2019-08-29|Method and system for automatic recognition of an end point of a sound recording

US9775113B2|2017-09-26|Voice wakeup detecting device with digital microphone and associated method

EP1058925B1|2004-04-07|System and method for noise-compensated speech recognition

EP1164580B1|2015-10-28|Multi-mode voice encoding device and decoding device

US7043422B2|2006-05-09|Method and apparatus for distribution-based language model adaptation

JP3284832B2|2002-05-20|Speech recognition dialogue processing method and speech recognition dialogue device

KR100438826B1|2004-07-05|System for speech synthesis using a smoothing filter and method thereof

US8311813B2|2012-11-13|Voice activity detection system and method

US6324509B1|2001-11-27|Method and apparatus for accurate endpointing of speech in the presence of noise

US7917357B2|2011-03-29|Real-time detection and preservation of speech onset in a signal

US6138095A|2000-10-24|Speech recognition

US5732394A|1998-03-24|Method and apparatus for word speech recognition by pattern matching

US6694296B1|2004-02-17|Method and apparatus for the recognition of spelled spoken words

JP5203923B2|2013-06-05|Time-stretch the frame inside the vocoder by modifying the residual signal

US6188981B1|2001-02-13|Method and apparatus for detecting voice activity in a speech signal

US8140330B2|2012-03-20|System and method for detecting repeated patterns in dialog systems

KR101056511B1|2011-08-11|Speech Segment Detection and Continuous Speech Recognition System in Noisy Environment Using Real-Time Call Command Recognition

JP4802135B2|2011-10-26|Speaker authentication registration and confirmation method and apparatus

同族专利:

公开号 | 公开日

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

法律状态:
1999-07-24|Application filed by 조정남, 에스케이 텔레콤 주식회사

1999-07-24|Priority to KR1019990030263A

2001-02-15|Publication of KR20010011066A

优先权:

申请号 | 申请日 | 专利标题

KR1019990030263A|KR20010011066A|1999-07-24|1999-07-24|Method for detecting end point of voice using adaptive codebook energy and adaptive codebook gain|

[返回顶部]